An air quality index prediction model based on CNN-ILSTM

Air quality index (AQI) is an essential measure of air pollution evaluation, which describes the air pollution degree and its impact on health, so the accurate prediction of AQI is significant. This paper presents an AQI prediction model based on Convolution Neural Networks (CNN) and Improved Long Short-Term Memory (ILSTM), named CNN-ILSTM. ILSTM deletes the output gate in LSTM and improves its input gate and forget gate, and introduces a Conversion Information Module (CIM) to prevent supersaturation in the learning process. ILSTM realizes efficient learning of historical data, improves prediction accuracy, and reduces the training time. CNN extracts the eigenvalues of input data effectively. This paper uses air quality data from 00:00 on January 1, 2017, to 23:00 on June 30, 2021, in Shijiazhuang City, Hebei Province, China, as experimental data sets, and compares this model with eight prediction models: SVR, RFR, MLP, LSTM, GRU, ILSTM, CNN-LSTM, and CNN-GRU to prove the validity and accuracy of CNN-ILSTM prediction model. The experimental results show the MAE of CNN-ILSTM is 8.4134, MSE is 202.1923, R2 is 0.9601, and the training time is 85.3 s. In this experiment, the performance of this model performs better than other models.

1. Through the research of RNN and LSTM and time-series data, this paper presents an improved LSTM network, ILSTM, which deletes the output gate of LSTM, improves the input gate and forget gate of LSTM, and introduces a CIM to prevent supersaturation in the learning process. 2. Compared with LSTM, ILSTM proposed in this paper has fewer parameters, lower computational complexity, and less training time on the premise of ensuring the prediction accuracy of AQI. 3. This paper presents an AQI prediction model based on CNN-ILSTM. The introduction of CNN can well extract eigenvalues of input data. Through comparative experiment, the combination of CNN and ILSTM can improve the accuracy of AQI prediction. And compared with the other eight prediction models, the AQI prediction result based on CNN-ILSTM performs better.

Related work
Traditional regression models for time series prediction include Random Forest Regression (RFR), Support Vector Regression (SVR), and Multi-Layer Perceptron (MLP). Ganesh et al. used SVR to forecast the AQI of Delhi and Houston 25 . However, because of the unstable characteristics of AQI data, it was difficult for SVR to achieve a high fitting degree. Zhang et al. proposed the RFR based on Spark clustering for air quality prediction 26 , but for the prediction of nonlinear air quality data, the RFR had the risk of over-fitting. Duro et al. used MLP to forecast the concentration of PM10, and O 3 in industrial areas 27 . However, for many non-stationary time series data, the traditional MLP prediction model had the problem of low prediction fitting degree. RNN significantly improved the fitting degree of time series prediction. Compared with standard neural networks, the calculation results of RNN's every hidden layer were related to the current input and the last hidden layer's result. By this method, the calculation result of RNN had the characteristic of remembering the previous results. For example, Wang used RNN to predict air quality 28 . Because of RNN's long-term dependence on data, the issues of "gradient explosion" and "gradient disappearance" will appear during model training 29 .
The gated technology has alleviated the issues of "gradient explosion" and "gradient disappearance" caused by the RNN's long-term dependence on data to a great extent. For example, Ysc et al. used LSTM to forecast changes in air pollutants 30 . Because of the single prediction model, the extraction of eigenvalues was often insufficient, making it difficult to achieve high precision prediction. Dsa

Models
CNN. Compared with the traditional neural network model, CNN has some unique advantages. For example, with the increase of hidden layers and nodes of the neural network, traditional neural network W weight parameters and B biases parameters will gradually increase, so the amount of calculation will also gradually increase. But CNN realizes parameter sharing, so the amount of calculation is greatly reduced 35,36 , as shown in Fig. 1. CNN can handle more complex data environments and problems with unclear data background and unclear inference rules, and allow the sample to have larger defects and distortions [37][38][39][40] . CNN can also well realize feature extraction of local signals, and the combination of CNN, RNN, and LSTM has been widely used in www.nature.com/scientificreports/ feature extraction of time series data [41][42][43] . Therefore, CNN can effectively extract features from non-linear and unstable air quality data.
ILSTM. Model structure. LSTM has excellent advantages in mining long-term dependence relationships of sequence data. Figure 2 shows three types of gates: forget gate, input gate, and output gate, respectively. A gate can be regarded as a full connection layer, and LSTM stores and updates information by these gates 44,45 . Gated Recurrent Unit (GRU) has only two gates. GRU combines the input gate and forget gate in LSTM into one, which is called the update gate 46 , as shown in Fig. 3. Based on the gated technology, the ILSTM model proposed in this paper consists of input gate and forget gate, as shown in Fig. 4. Compared with LSTM, ILSTM deletes the output gate. Compared with GRU, ILSTM structure is simpler. The parameters of LSTM, GRU, and ILSTM are shown in Table 1. Compared with LSTM, ILSTM reduces weight parameters from 8 to 4 and bias parameters from 4 to 2. Compared with GRU, ILSTM reduces weight parameters from 6 to 4 and bias parameters from 3 to 2.
In terms of algorithm, ILSTM adds the cell state c t−1 of the previous moment to the algorithm of the forget gate to generate the mainline forgetting k t , which affects the data retention degree at the current time. In addition, when updating the cell state c t of the current moment, the CIM is introduced to prevent supersaturation in the learning process.
The forget gate of the ILSTM f t is a crucial component of ILSTM unit, which can control what information should be retained and what information should be forgotten. σ (x) is a Sigmoid function, as shown in formula (1). x t is the input data of the t-th time step. h t−1 is the hidden layer of the previous time step t − 1. W fh is the weight of h t−1 of forget gate, and W fx is the weight of x t . b f is the bias of forget gate, as shown in formula (2).
Mainline forgetting k t is calculated by cell state c t−1 and f t . Mainline forgetting represents the influence of information on current cell state c t , where c t−1 is cell state information from the beginning to the previous moment, as shown in formula (3).
The input gate i t controls how much of the current input data x t flows into the memory cell, that is, how much can be saved to c t . Compared with the input gate of LSTM, ILSTM adds c t−1 to the input gate algorithm, that is, the cell state information up to the previous moment. The introduction of c t−1 makes the input gate of the model have a memory effect on the retention of data at the current time, as shown in formula (4), W ih and W ix are the weights of the input gate's h t−1 and x t , respectively, and b i is the bias of the input gate.
(3) k t = f t × c t−1 . www.nature.com/scientificreports/ Due to the characteristics of the Sigmoid activation function, when the value of x is outside − 3 and 3, the value of the Sigmoid activation function will fall into a supersaturation interval. Therefore, in formula (4), when the input data enters the input gate's supersaturation, the value does not change significantly, decreasing learning sensitivity. A CIM is introduced into ILSTM model to prevent this phenomenon, As shown in formula (5). The Sigmoid function value (ranging from 0 to 1) calculated by the above formula is taken as the input of Tanh. Tanh and Sigmoid function as shown in Fig. 5.
The value of tanh(i t ) will be between [0, 0.762), as shown in the dotted line part by the tanh function in Fig. 5, so the obtained value will be more uniform and significant. Therefore, the value output by the CIM will greatly reduce the supersaturation degree, and the significant difference makes the model calculation more recognizable, thereby making the model learning more sensitive.
Formula (6) shows that c t is the information kept from the beginning to the present. h t indicates the information preserved at the current time. c t controls how much information can be kept through tanh function, as shown in formula (7).
Formula derivation of ILSTM. ILSTM is proposed to improve the model's prediction accuracy and reduce the model's training time on the premise that the model can alleviate the issues of "gradient explosion" and "gradient disappearance" of the RNN. Input gate and forget gate use two parameter matrices [W fh , The L t function of W is the loss corresponding to h t . L is the total loss. As for the derivative of W of L , as shown in formula (8): The RNN updates the W parameter by formula (9): where ∂L t ∂W can be written as formula (10): www.nature.com/scientificreports/ Formula (10) can be simplified to formula (11): where c t is shown in formula (12): The CIM = tanh(i t ) , so formula (12) can be written as formula (13): The derivative of c t can be obtained by formula (14): Then the total loss can be written as formula (15): Then record that x is equal to formula (16): Then that f t can be written as formula (17): Then record that y is equal to formula (18): Then that i t can be written as formula (19): Then the formula (15) can be written as formula (20): Then record that z(x, y) is equal to formula (21): Then the formula (20) can be written as formula (22): As shown in formula (22), the gradient of the function is ∂L k ∂h k ∂h k ∂c k k t=2 z(x, y) ∂c 1 ∂W . When z(x, y) is greater than 1, the gradient may be too large with the increase of data amount. When z(x, y) is too small, the gradient disappears easily.
In this model, the σ (x) function is shown in Fig. 6, and the (1 − tanh σ (y) 2 )σ ′ (y ) function is shown in Fig. 8. It can be seen from the figure that the value range of function gradient z(x, y) will be more reasonable. Therefore, this model can alleviate the problems of "gradient disappearance" and "gradient explosion" to a great extent.

CNN-ILSTM.
The structure of CNN-ILSTM is shown in Fig. 9. The CNN-ILSTM model is generally divided into four parts. The first layer is the data input layer. This paper takes AQI as the research object and air quality data as the model input. The second layer is the data preprocessing layer. To ensure the reliability of the prediction results and improve the accuracy of the prediction results, it is necessary to conduct standardized processing of the original data, three-dimensional time series construction and other pre-processing operations. The third layer is the feature extraction layer, which realizes feature extraction of air quality data by taking advantage of www.nature.com/scientificreports/  Data collection and preprocessing. The air quality data used in this experiment are obtained from http:// data. epmap. org/. There are often some problems in the original data, such as missing and duplicating some data. Some of the original data of the summarized air quality data are shown in Table 2.
In this experiment, the original data are processed as follows: 1. Delete duplicates. There are duplicate data in the original data, for example, there are duplicate data in the data at 3:00 on February 1, 2021 and 3:21 on February 1, 2021. In this experiment, keep the last data and delete the previous duplicate data. 2. Data filling. In the process of air quality data detection, data loss may be caused by network interruption, storage failure, and other reasons, such as the data at 1:00 on February 1, 2021. These low-quality data will affect the model's learning effect. As a result, the final prediction accuracy is not high, and there is a problem of missing values in the original data. Considering that the air pollution data changes smoothly with time in most cases, and there is generally no sudden change in values, this experiment uses the average value of the data of one hour before and one hour after to fill in the missing parts 47 , as shown in formula (23).
where V t−1 is the data of one hour before time t, V t is the missing value at time t, and V t+1 is the data of one hour after time t.
Because the environmental protection department calculates AQI through six main pollution indexes: PM2.5, CO, O 3 , NO 2 , PM10, and SO 2 , these six indexes are introduced as input items of the data set in this experiment [48][49][50] . Air quality data from 00:00 on April 4, 2019 to 23:00 on June 30, 2021 in Shijiazhuang city, Hebei Province, China are used as experimental data set. There are 39,408 pieces of data in this data set. The data obtained after data preprocessing in Table 2 are shown in Table 3.    Table 3. Experimental data. www.nature.com/scientificreports/ Data normalization. There is a big difference between the sample values of some features and those of other features in the data set, which may lead to slow convergence and reduce the training accuracy of the model. In this experiment, z-score normalization processes the original data, as shown in formula (24), where σ is the standard deviation of the original data, x is the mean of the original data, and x * is the value after standardization. After the data standardization, the data is dimensionless and scaled to the same interval. In addition, the features are comparable, and the trend and relative size of the scaled data do not change, which speeds up the model convergence.

PM2.5 (μg/m 3 ) CO (μg/m 3 ) O 3 (μg/m 3 ) NO 2 (μg/m 3 ) PM10 (μg/m 3 ) SO 2 (μg/m 3 ) AQI
Three-dimensional time series data construction. This experiment uses the method of constructing time series, takes the time of the input data as a sequence, and carries out two-dimensional segmentation and three-dimensional construction of the input data. In Fig. 11, assuming that there are X pieces of experimental data, the data is constructed in three dimensions according to the setting of step = 1 and sequence = 24. The data from the first to the 24th constitute layer Y 1 , and data from the second to the 25th constitute layer Y 2 , and so on. Complete a total of X-23 layers (Y 1 , Y 2 … Y X-32 ) construction; each layer contains 24 pieces of data, that is, the three-dimensional data construction is completed. The constructed time series data are divided into training set, validation set, and test set in this experiment. The prediction model takes the first 23 data of each layer as input and the AQI value of the 24th layer as output for training, validation and evaluation.
Data set segmentation. During the model designing and training process, model parameters (such as changing weights, choosing the number of layers and the size of each layer) need to be adjusted 51 . In the process of model training, it is necessary to provide feedback information through the prediction performance of the validation set, to adjust the network model and parameters, which is also the role of the validation set. However, in the training process, the information of the validation set will be leaked. The more feedback adjustment of the model, the more information will be leaked, so the model will more clearly "understand" the experimental set, which will eventually cause the model to fail on over-fitting on the validation set. At this time, a data set which is completely "unfamiliar" to the model-the test set is needed to measure the overall performance of the model prediction. So after presetting model parameters, the data set will be divided into the training set, validation set, and test set. According to experience, the data volume ratios are as follows: 8:1:1, 7:2:1, 6:3:1, 7:1:2, 6:2:2, 5:3:2, 6:1:3, 5:2:3 and 4:3:3. In different data set segmentation ratios and different model prediction results, the prediction fitting degree of the validation set is shown in Table 4. In this experiment, when the data ratio of the training set, validation set and test set is 7:2:1, the prediction fitting degree of different models is higher. Therefore, the ratio of the training set, validation set and test set in this experiment is 7:2:1.
Because the neural network has a strong fitting ability, if the data set is trained in chronological order and the "batch" of the same combination appears repeatedly, the model may produce an over-fitting state through learning, thus affecting the test of the generalization ability of the model in the experiment. Therefore, in the process of this experiment, the order of data input is interrupted in every training, validation, and test of the model.

Model parameter adjustment.
In deep learning, a given machine learning algorithm has model parameters and model hyper-parameters. Model parameters are generally internal variables, such as bias, weight, etc. These parameters are not set manually but are automatically learned and obtained through model training data. The model's hyper-parameters are set before the model training and are often designed manually by the experience of researchers. Model hyper-parameters can be divided into structural hyper-parameters and running hyper-parameters. Structural hyper-parameters refer to configurations that play a decisive role in model structure, such as filters, padding, and kernel_size in convolution layer; pool_size and padding in pooling layer; units and kernel_initializer in ILSTM layer. Running hyper-parameters are used to run neural networks, such as (24) x * = x − x σ . www.nature.com/scientificreports/ batch_size, epochs, and learning_rate. Traditional manual design of hyper-parameters is time-consuming, inefficient, and costly, and even the results of the hyper-parameters model designed by experimenters are difficult to reproduce and expand. This experiment combines empirical mode and hyper-parameter optimization technology (Grid search optimization algorithm) to adjust parameters. The purpose of hyper-parameter optimization is to find a suitable set of parameters in the algorithm model so that the model has good expression ability and generalization ability. Based on experience, we select the parameters of batch_size using 110, 120, 130, 140, and 150. We select epochs using 80, 90, 100, and 110. We select learn_rate using 0.01, 0.005, 0.002, 0.001, and 0.0009. After selecting filters, pool_size, units, learning_rate, and other parameters by grid search optimization algorithm, the details of the parameters set in the final experiment are shown in Table 5.

Experiment analysis. Model convergence.
After the model is built and the parameters are set, it is necessary to verify whether CNN-ILSTM normally converges during training. When all parameters are the same and epoch = 100, 1-0 loss function is used in this experiment to show its convergence. In this experiment, the convergences of CNN-ILSTM, CNN-GRU, and CNN-LSTM are shown in Fig. 12. The loss function of CNN-ILSTM is smaller than that of CNN-LSTM before training 10 times, so the convergence speed of CNN-ILSTM is faster than that of CNN-LSTM in this experiment. The loss function of CNN-ILSTM is smaller than that of CNN-GRU before training 5 times, so the convergence speed of CNN-ILSTM is faster than that of CNN-GRU in this experiment. Table 4. Fitting degree of different prediction models in segmentation of data sets with different ratios.    (25)(26)(27).
where m is the number of data in the test set; ŷ i is the predicted value; y i is the true value; y i is the average value of the true values.
Experimental results. To verify the accuracy of CNN-ILSTM in predicting AQI, traditional regression models (SVR, RFR, and MLP), recurrent neural network models based on gated technology (LSTM, GRU, ILSTM), and hybrid prediction models (CNN-LSTM, CNN-GRU) are introduced as comparison models. Experimental results are shown in Table 6. The test set prediction evaluation results show that the traditional regression models SVR, RFR, and MLP have a lower prediction fitting degree than the recurrent neural network model based on gated technology. The R 2 of LSTM is 0.0697 higher than that of SVR, the R 2 of LSTM is 0.0542 higher than that of RFR, and the R 2 of LSTM is 0.0341 higher than that of MLP. The predicted and true values of SVR, RFR, MLP, and CNN-ILSTM are shown in Fig. 13.  Fig. 16.
Discussion. In the experiment using this test set, the overall evaluation index of the CNN-ILSTM AQI prediction model performs better than other models. Compared with the traditional regression models, the recurrent neural network models based on the gated technology have a better prediction fitting degree. Compared with LSTM and GRU, ILSTM significantly reduces the training time due to the reduction of ILSTM parameters on the premise of maintaining higher prediction accuracy. CNN-ILSTM compared with ILSTM, the introduction of CNN improves the prediction accuracy. Compared with CNN-LSTM and CNN-GRU, the prediction accuracy and training time of CNN-ILSTM are better.
The reasons for the improvement of CNN-ILSTM's AQI prediction accuracy are as follows:

Conclusions
This paper presents an AQI prediction model based on CNN-ILSTM. Compared with the traditional regression models of SVR, RFR, and MLP, and the deep learning models of LSTM, GRU, ILSTM, CNN-LSTM, and CNN-GRU, the overall evaluation of prediction results of CNN-ILSTM is best. ILSTM is proposed for the first time. ILSTM is improved and optimized in model design and parameter ratio on the premise of high prediction accuracy and alleviating the issues of "gradient explosion" and "gradient disappearance" in the RNN caused by long-term data dependence. Compared with LSTM and GRU, the training time of ILSTM is reduced by 48.6% and 10.34%, and ILSTM has the best AQI prediction results. In addition, the introduction of CNN makes up for the deficiency of ILSTM feature extraction and learning. The experiment results show that the MAE of CNN-ILSTM decreases by 0.284798, and the R 2 increases by 0.013951 compared with ILSTM AQI prediction. The conclusions of this paper are as follows: 1. ILSTM has performed better than LSTM in my tests. ILSTM is an improvement of LSTM, which deletes the output gate in LSTM, improves its input gate and forget gate, and introduces a CIM to prevent supersaturation in the learning process. On the premise of ensuring that the model can alleviate the issues of "gradient explosion" and "gradient disappearance" of RNN and has high prediction accuracy. Compared with LSTM and GRU, ILSTM significantly reduces the training time. 2. The AQI prediction model of CNN-ILSTM makes up for the shortcomings of the single prediction model, such as insufficient feature data extraction and insufficient historical data learning. In this experiment, the AQI prediction model of CNN-ILSTM is the best. 3. The model design and parameter tuning are improved and optimized, so the convergence rate of the AQI prediction model based on CNN-ILSTM is improved.
However, the AQI prediction model of CNN-ILSTM does not perform well in extreme value prediction. Therefore, the following research will carry out the high-precision prediction of extreme values.  www.nature.com/scientificreports/